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Abstract 


Background: Despite the passing of more than a year since the first outbreak of Severe Acute 
Respiratory Syndrome (SARS), efficient counter-measures are still few and many believe that 
reappearance of SARS, or a similar disease caused by a coronavirus, is not unlikely. For other virus 
families like the picornaviruses it is known that pathology is related to proteolytic cleavage of host 
proteins by viral proteinases. Furthermore, several studies indicate that virus proliferation can be 
arrested using specific proteinase inhibitors supporting the belief that proteinases are indeed 
important during infection. Prompted by this, we set out to analyse and predict cleavage by the 
coronavirus main proteinase using computational methods. 


Results: We retrieved sequence data on seven fully sequenced coronaviruses and identified the 
main 3CL proteinase cleavage sites in polyproteins using alignments. A neural network was trained 
to recognise the cleavage sites in the genomes obtaining a sensitivity of 87.0% and a specificity of 
99.0%. Several proteins known to be cleaved by other viruses were submitted to prediction as well 
as proteins suspected relevant in coronavirus pathology. Cleavage sites were predicted in proteins 
such as the cystic fibrosis transmembrane conductance regulator (CFTR), transcription factors 
CREB-RP and OCT-1|, and components of the ubiquitin pathway. 


Conclusions: Our prediction method NetCorona predicts coronavirus cleavage sites with high 
specificity and several potential cleavage candidates were identified which might be important to 
elucidate coronavirus pathology. Furthermore, the method might assist in design of proteinase 
inhibitors for treatment of SARS and possible future diseases caused by coronaviruses. It is made 
available for public use at our website: http://www.cbs.dtu.dk/services/NetCorona/. 


another break-out of an epidemic of SARS virus or similar 
strains in the future. 


Background 
In the spring of 2003, the Severe Acute Respiratory Syn- 
drome (SARS) caused numerous fatalities particularly in 


Southeast Asia and gravely affected the global economy. 
The causative agent was shown to be a human coronavirus 
[1], a virus type which normally causes mild cold symp- 
toms in humans. The abrupt appearance raises concern of 


Coronaviruses are found in different species ranging from 
chicken to cattle and humans. Currently, seven coronavi- 
rus genomes, including SARS coronavirus (CoV), have 
been fully sequenced and cluster into four main groups, of 
which SARS-CoV occupies its own [2,3]. Polyproteins 
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encoded by the coronavirus RNA are processed by viral 
proteinases yielding mature proteins. The main protein- 
ase 3CL/” performs at least eleven proteolytic cleavages 
within a single viral polyprotein [4,5]. Viral polyprotein 
processing is acommon theme in viral molecular biology, 
e.g. aS seen in picornaviruses and retroviruses like HIV. 
Therefore, essential viral proteinases have been suggested 
as potential targets for specific therapeutic approaches, 
e.g. by development of specific proteinase inhibitors [6- 
8]. 


In the case of picornaviruses, virus-encoded proteinases 
are able to cleave specific cellular targets and thereby 
severely inhibit the cellular translational machinery (the 
"host cell shut-off" response) while still allowing for high 
translational activity of viral mRNA [9]. Earlier, we devel- 
oped a computational approach for predicting potential 
cleavage sites of picornavirus proteinases 2A and 3C [10]. 
Badorff et al. successfully used this cleavage predictor to 
identify the cellular target dystrophin, which they experi- 
mentally showed to be cleaved both in vitro and in vivo 
[11]. However, preliminary studies revealed that this 
model is not compatible with coronavirus cleavage sites. 
The general approach is still valid though, and we decided 
to apply this method to the problem of predicting the 
3CL/” proteinase cleavage sites and identifying potential 
host cell target proteins. We propose that a deeper under- 
standing of coronavirus proteinase function and substrate 
specificity may benefit further research by: i) increasing 
the understanding of substrate specificity determinants 
which may direct studies focusing on the development of 
specific proteinase inhibitors and ii) providing a method 
for screening cellular target proteins for potential corona- 
virus proteinase cleavage sites. 


In this paper, we describe the development of a computa- 
tional prediction method using artificial neural networks 
for predicting coronavirus 3CL?” proteinase cleavage sites. 
The method is based on known cleavage sites in seven 
members of the coronavirus family as the cleavage sites 
are believed to be sufficiently conserved among family 
members. This notion is supported by the fact that the 
SARS 3CI?"? proteinase has recently been shown capable 
of catalysing the cleavage of peptide fragments from other 
coronaviruses at the expected cleavage sites [12]. 


We discuss potential targets of 3CL/” proteinase, e.g. the 
cystic fibrosis transmembrane conductance regulator 
(CFTR) and translational and transcriptional factors, 
which may be involved in the molecular pathology of 
coronaviruses in general and SARS virus in particular. 
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Results 

Analysis of the proteinase cleavage site 

The 77 annotated coronavirus polyprotein main protein- 
ase cleavage sites were aligned without gaps by constrain- 
ing the P1 position. Every site had a glutamine (Q) in 
position P1 (the position just before the cleavage site; the 
positions are named as suggested by Berger and Schechter 
[13] with P1, P2, ... etc., N-terminal to the cleavage site 
and P1', P2', ... etc., C-terminal to the cleavage site). From 
the sequence logo (Figure 1) a very strong consensus is 
evident around the cleavage site. As discussed by others 
[14,15], the coronavirus 3C-like proteinase shares many 
traits with its picornavirus 3C proteinase counterpart, 
hence the name. This is reflected in the cleavage site logo 
although differences between the two are also apparent. 
Positions P1', P1, and P4 have similar amino acid distri- 
bution in the 3C and 3CL proteinase cleavage sites. On the 
other hand, the coronavirus proteinase has a strong pref- 
erence for leucine at position P2 while this position is rel- 
atively non-conserved among picornavirus proteinase 
cleavage sites [10]. A recently published study of the crys- 
tal structure of 3CL’” from the 229E strain of human coro- 
naviruses indicates that residues at positions P5 to P3 
form an anti-parallel B sheet with part of the proteinase, 
signifying their importance in cleavage site recognition 


[7]. 


It is clear from the above that a simple, position specific 
consensus sequence is difficult to define. With the present 
data set from seven different coronaviruses it is possible to 
classify correctly 60 (78%) of the 77 cleavage sites by 
matching an 'LQ' consensus pattern. However, an addi- 
tional 196 sites in the viral polyproteins are incorrectly 
classified as cleavage sites, being random occurrences of 
this pair of amino acids. Classification is improved by 
using the consensus pattern 'LQ [S/A]', meaning Leu-Gln- 
(Ser OR Ala), but it is still far from being a useful classifier. 
The false positive rate is now down to 36 wrong sites, but 
at the same time only 48 (62%) of the correct cleavage 
sites are detected. As the pattern becomes more sophisti- 
cated, specificity increases (reducing the number of false 
positives) but at the same time sensitivity drops dramati- 
cally (i.e. fewer of the true sites are detected). 


Neural network training and performance 

To overcome the limitations of simple consensus patterns, 
we trained an artificial neural network to identify the 
cleavage sites. The best model was obtained using a three- 
layered neural network with two hidden neurons and a 
sequence window encompassing nine amino acids cen- 
tered on the P1 position, thus encompassing P5-P4'. The 
network evaluates and assigns a score between 0 and 1 to 
every glutamine to which it is presented, where a score 
above 0.5 is considered a positive answer (i.e. a cleavage 
site is predicted). This model was able to classify correctly 
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Figure | 

Logo plot of a multiple alignment of 77 coronavirus 

cleavage sites. The height of the letters reflects the Shan- 
non information at individual positions (see Methods section 
for detailed information). 


67 of 77 known cleavage sites (87.0%) and 1,358 of 1,372 
(99.0%) sites assumed not to be cleaved by the proteinase 
when testing on independent sites not included when 
training. The neural network method could thus identify 
many more of the positive sites with fewer false positives 
than simple consensus-type methods thereby increasing 
the classification performance. The Matthews correlation 
coefficient reached 0.84 for the artificial neural network 
compared to 0.37, 0.53 and 0.51 for increasingly complex 
consensus patterns (‘LQ', 'LQ [S/A]', ' [T/S/A]X [L/F]Q [S/ 
A/G]' respectively) (Figure 2). 


To evaluate the predictive power of the neural network, 
we performed a basic bayesian analysis of the data set test 
results. The scoring range from 0 to 1 was divided into ten 
bins and the posterior probability of a positive prediction 
(a prediction indicating a cleavage) being true was calcu- 
lated and plotted (Figure 3). The posterior probability in 
the range 0.5 to 0.8 cannot be determined accurately since 
relatively few examples score in this interval — only 3% of 
the test set (both positive and negative examples) scores 
between 0.4 and 0.8. However, results indicate that pre- 
diction scores can be classified into three categories, those 
that fall below 0.5 are most likely not cleaved, those that 
fall between 0.5 and 0.8 are possibly cleaved and those 
above 0.8 are most likely cleaved if available to the 
proteinase. 


Analysis of selected human proteins 

As mentioned above, there are several experimentally ver- 
ified examples of host cell protein cleavage by virus pro- 
teinases. Thus, both these and other non-coronavirus 
proteins from Swiss-Prot [16] 41.0 were examined for 
potential cleavage sites. In total three groups of proteins 
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were examined: i) proteins known to be cleaved by other 
viruses, ii) proteins which could be targets when consider- 
ing the pathology of coronaviruses iii) proteins related to 
the expected immune response to a viral infection. 
Eukaryotic translation initiation factor 4 gamma 
(IF4G_HUMAN) has a potential cleavage site after 
G1n838 (0.822), but also at two other positions although 
with lower cleavage scores. Cleavage of this protein may 
lead to host cell shut-off in a similar way to what has been 
described for picornavirus 2A proteinase [17]. 


Two subunits of the RNA polymerase III are predicted tar- 
gets of the coronavirus proteinase 3CL’”. RNA polymerase 
(RPC1_HUMAN) has a predicted cleavage site after 
GIn195 with a score (0.765) well above the 0.5 cut-off. The 
protein is the second largest subunit of the RNA 
polymerase III complex and if this protein is indeed a cel- 
lular proteinase target it might cause disruption of the 
RNA polymerase III complex upon infection with a coro- 
navirus. A similar disruption would be expected in case of 
a cleavage of the largest subunit of the complex 
(RPA1_HUMAN) which also has a predicted cleavage site 
(at position 329, score 0.704). It agrees with findings that 
poliovirus disrupts RNA polymerase II function, 
although this occurs through cleavage of transcription fac- 
tor HIC and not the polymerase subunits themselves [18- 
20]. Several well-known transcription factors contain 
potential cleavage sites. The highest scoring is CREB-RP 
(AT6B_HUMAN) with a predicted cleavage site at GIn358 
(0.916) close to the DNA binding leucine zipper motif. 
This is in agreement with findings from picornavirus 3C/ 
proteinase although at a different position in the sequence 
[21]. OCT-1 (PO21_HUMAN) is also predicted to be 
cleaved by the 3CL’” proteinase with high confidence 
(0.874) following Gln62 again corresponding to experi- 
mental evidence from picornavirus [22]. Several subunits 
of the transcription initiation factor TFIID, which is a ver- 
ified target in poliovirus infections [23], have predicted 
cleavage sites; the 250 kDa subunit (T2D1_HUMAN), the 
135 kDa subunit (T2D3_HUMAN), and the 105 kDa sub- 
unit (T2DT_HUMAN). 


The tumor-suppressor protein P53 is known to be cleaved 
by picornavirus 3C?"° proteinase [24] but this protein is 
not predicted to contain any coronavirus 3C?” proteinase 
cleavage sites. However, P53-binding protein 1 
(P531_HUMAN) and _ P53-binding protein 2 
(P532_HUMAN), which stimulate p53-mediated tran- 
scriptional activation [25], have several potential cleavage 
sites. 


Another known target for viral infections is the microtu- 
bule-associated protein 4 (MAP-4) which is cleavable in 
HeLa cells by the poliovirus 3C? proteinase [26,27]. 
MAP-4 (MAP4 HUMAN) might also be cleavable by 
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Figure 2 
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Pattern 


NN 


Method performance comparison. Using consensus patterns or neural network (NN) to identify cleavage sites. Green 
bars are percentage of true positives, red bars are percentage of true negatives and blue bars are Matthews correlation coeffi- 


cients multiplied with 100. 


3CL/” albeit with a low score (after Gln 1005 with a score 
of 0.519) and furthermore microtubule-associated pro- 
tein RP/EB member 1 and 3 (MAE1_HUMAN and 
MAE3_HUMAN) have sites which obtain scores above 
0.5. The position of the possible cleavage site in MAP-4 is 
different from that observed with poliovirus 3C?” reflect- 
ing the different specificity of this proteinase. 


Lung related proteins were examined as early symptoms 
of SARS could indicate a relation. The cystic fibrosis trans- 
membrane conductance regulator (CFTR_-HUMAN) is an 
ATP-dependent chloride channel. It has a predicted cleav- 
age site with a high score (0.842) following Gln762 in the 
human sequence. This part of the membrane protein is 


cytoplasmic and contains several phosphorylation sites 
(residues 660 - 813) indicating an accessible region. 


The epithelial sodium channels play an important role in 
lung liquid homeostasis [28] and the amiloride-sensitive 
sodium channel 6-subunit (SCAD_HUMAN) has a pre- 
dicted cleavage site in the cytoplasmic C-terminus (after 
residue 22) scoring 0.828. A number of proteins involved 
in the ubiquitin pathway which targets proteins to the 
proteasome, a necessary step to generate an immune 
response, have predicted cleavage sites (Swiss-Prot entries 
UBP1_HUMAN, SOC6_HUMAN, UBPD_HUMAN, 
UBP4_HUMAN, UBP5_HUMAN, UBPQ HUMAN, 
FAFY_HUMAN, FAFX_HUMAN). Cleavage of one or 
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Scoring range 


Figure 3 


Reliability analysis of data set test results. Scoring range (0 — |) was divided into ten bins. The fraction of negative exam- 
ples in each bin is illustrated with red bars, the fraction of positive examples is illustrated with green bars, blue bars are poste- 
rior probabilities of a true cleavage prediction (see Methods section for detailed information). 


more of these proteins may lead to reduced presentation 
of viral peptides to cytotoxic T lymphocytes thereby 
inhibiting the cellular immune response. IRAK-1 
(IRA1_HUMAN) which is involved in IL-1 induced activa- 
tion of cells has a predicted cleavage site after Gln457 
scoring 0.859. 


Interferon-induced protein 6-16 precursor 
(INI2_HUMAN) is a membrane protein and was pre- 
dicted to possess a cleavage site following Gln97 (0.890) 
which is located in the cytoplasmic part of the mature pro- 
tein. Protein 6-16 has been shown to enhance interferon- 
o antiviral efficacy [29]. Interferon-a, -B, and -y are known 
to be involved in antiviral defence and have been 
employed for treatment of SARS [30], but the interferons 
themselves do not seem to possess cleavage sites. 


We have listed the human proteins analysed in this work 
in a table (Table 1). 


Discussion 

We have developed a neural network capable of identify- 
ing the cleavage site of the coronavirus proteinase 3CL’” 
and use this model to predict potential cleavage sites in 
host cell proteins. The predictor is highly specific which 
means that few false positives are expected, in fact on 
independent test sets we observed a false positive rate 
around 1%. The optimal network window size of nine res- 
idues agrees well with available structural information 
about the proteinase from human coronavirus 229E 
which indicates that the active site makes contact with at 
least four residues N-terminal to the glutamine [7]. 


The ten sites known to be cleaved but failed to be recog- 
nised by the neural network are not dramatically different 
from the remainder of the sites (Table 2). We therefore do 
not suspect these to be sites of a different hitherto 
unknown proteinase, but it would be interesting to see if 
the lower prediction score reflects a lower cleavage effi- 
ciency in vivo. Of the fourteen negative examples wrongly 
predicted as cleavable (Table 3), the highest scoring were 
examined more closely. The selected examples all show 
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Table |: Selected potential cleavage sites in human proteins from the Swiss-Prot database examined in this work. Columns represent 
Swiss-Prot identifier, predicted cleavage site position of P| in the target protein, cleavage site score, and cellular localisation of target 
protein (Cyt — cytoplasmic, Nuc — nuclear, Mem - membrane associated). The last column lists the cleavage site in the sequence — 
cleavage is predicted between the central glutamine residue (Q) and the following amino acid residue. Sorted by prediction score. 


Swiss-Prot ID Loc Position Score Sequence 
AT6B_HUMAN Nuc 358 0.916 EARLQAVLAD 
INI2_HUMAN Mem 97 0.890 VATLQSLGAG 
PO21_HUMAN Nuc 62 0.874 GTSLQAAAQS 
IRAI_HUMAN Cyt 457 0.859 QSTLQAGLAA 
CFTR_HUMAN Mem 762 0.842 GPTLQARRRQ 
SCAD_HUMAN Mem 22 0.828 GSHLQAAAQT 
P532_HUMAN Nuc 308 0.782 ASVPQSTGNA 
RPC|_HUMAN Cyt 195 0.765 SNFLQSFETA 
P531_ HUMAN Nuc 196 0.738 KEQLQSVTTN 
T2D!1_HUMAN Nuc TAI 0.730 GQLLQAFENN 
P532_HUMAN Nuc 197 0.725 KAALQQKENL 
RPAI_HUMAN Cyt 329 0.704 TVNLQAVMKD 
CFTR_HUMAN Mem 958 0.693 HSVLQAPMST 
MAEI_HUMAN Cyt 64 0.661 KVKFQAKLEH 
MAE3_HUMAN Cyt 64 0.661 KVKFQAKLEH 
P531_ HUMAN Nuc 410 0.660 QKKLQSGEPV 
CFTR_HUMAN Mem 890 0.654 NTPLQDKGNS 
P532_HUMAN Nuc 722 0.624 SPNLQNNPEE 
T2DT_HUMAN Nuc 133 0.619 PSSVQSVAVP 
T2D3_HUMAN Nuc 610 0.570 SSGKQSTETA 
MAP4_HUMAN Cyt 1005 0.519 YSHIQSKCGS 


Table 2: Known main proteinase cleavage sites in coronavirus polyproteins used in this study, which were missed by the neural network 
during cross-validation. Position refers to position in the viral polyprotein. The last column lists the cleavage site in the sequence — 
cleavage occurs between the central glutamine residue (Q) and the following amino acid residue. 


Accession Position Virus Sequence 
NC_001451 3928 AIBV KSSVQSVAG 
NC_001846 3923 MHV VSQIQSRLT 
NC_001846 5984 MHV NPRLQCTTN 
NC_002306 5527 TGV KIGLQAKPE 
NC_003045 5900 BCoV ETRVQCSTN 
NC_003436 3299 PEDV GVNLQGGYV 
NC_003436 6141 PEDV SNNLQGLEN 
NC_004718 3546 SARS GVTFQGKFK 
NC_004718 4369 SARS EPLMQSADA 
NC_004718 5902 SARS VATLQAENV 


some resemblance to real cleavage sites but also some 
resemblance to negative examples which are not predicted 
as cleavable. They may represent sites in-between which 
are cleavable to a certain extent but are shielded from 
cleavage due to conformational issues. 


Predicted sites even with high scores which are inaccessi- 
ble to the proteinase (like extracellular domains, 
transmembrane domains, or buried domains in globular 


proteins) should be disregarded, as accessibility informa- 
tion is not available to the neural network. Cleavage sites 
probably exist that are not cleaved because they are not 
exposed to the solvent sufficiently for the proteinase to 
work. 


Others have attempted recognising the cleavage sites of 
the 3CL proteinase as a component of a coronavirus gene 
prediction server using different methods [31]. As the goal 
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Table 3: Negative examples predicted to be cleaved by the neural network during cross-validation. Position refers to position in the viral 
polyprotein. The last column lists the cleavage site in the sequence — cleavage is predicted between the central glutamine residue (Q) 


and the following amino acid residue. 


Accession Position Virus Sequence 
NC_001846 3607 MHV HSGFQGKQI 
NC_001846 6613 MHV YTDLQCIES 
NC_002306 1457 TGV ETSLQCLLK 
NC_002306 5747 TGV YSSSQSVYA 
NC_002306 698 TGV ETNIQAIKN 
NC_002306 85 TGV SVMLQGFIV 
NC_002645 1169 HCoV-229E IRQLQGTII 
NC_002645 2659 HCoV-229E YSSIQANAY 
NC_002645 322 HCoV-229E VIALQSVDC 
NC_003045 1364 BCoV DARTQGKQS 
NC_003045 1498 BCoV RTFVQSNVD 
NC_003045 2713 BCoV SSDFQHKLK 
NC_003045 311 BCoV VMRLQSAST 
NC_003436 1751 PEDV SAGLQAMWE 


was different, that predictor is not publicly available and 
no performance values have been published. 


Conclusions 

Our method can be employed by researchers suspecting a 
possible viral proteinase cleavage but may also prove use- 
ful for researchers working with coronavirus function. 
Finally, the method might facilitate proteinase blocking 
based drug discovery by providing hints about proteinase 
affinity to various non-cleavable peptide ligands, which is 
a possible strategy for drug development [7,32]. 


Methods 

Data Set Preparation 

Seven full-length coronavirus genomes were retrieved 
from the GenBank database [33] with the following acces- 
sion numbers: NC_001451 (Avian infectious bronchitis 
virus, AIBV), NC_001846 (Murine hepatitis virus, MHV), 
NC_002645 (Human coronavirus 229E, HCoV-229E), 
NC_003436 (Porcine epidemic diarrhea virus, PEDV), 
NC_003045 (Bovine coronavirus, BCoV), NC_002306 
(Transmissible gastroenteritis virus, TGV), and the TOR2 
strain of SARS NC_004718. Deduced polyprotein 
sequences were aligned and cleavage sites identified from 
the annotation in NC_004718. Each sequence contained 
eleven 3CL?” proteinase cleavage sites, thus a total of 77 
of these sites were identified. For training a neural net- 
work classifier, a number of negative examples (presumed 
non-cleavage sites) are required. For this purpose, all 
other glutamines in the viral polyproteins were treated as 
non-cleavable sites. 


Three test sets were created for three-fold cross-validation 
and the training set for one was created by combining the 


two other test sets. Every test set thus contained 483 exam- 
ples of which 25 or 26 were positive examples. All testing 
and results reported are combined values of the three test 
sets, which are run individually with three separate neural 
networks to avoid testing on sequences included in train- 
ing sets. 


Sequence logos 

Amino acid conservation in multiple sequence align- 
ments may be visualised using sequence logos. The height 
of the amino acid one-letter abbreviations reflect the 
Shannon information content [34] in units of bits at that 
specific position in the multiple sequence alignment [35]. 
The basic idea behind the visualisation technique is that 
the height of each letter in a given position reflects its 
probability p,(i). The total height of the column reflects 
the total information content (D(i)) at that specific posi- 
tion in the alignment given by (for proteins): 


20 
D(i) = log, 20+ Fp, (i)log, Py (i) 

k=1 
Very conserved positions will then get tall columns with 
the height of individual residue symbols reflecting the 
amino acid distribution. 


Training the neural networks 

The artificial neural networks used in this work were of the 
standard feed-forward type. Sparse encoding was used for 
translating the amino acids to data input for the networks 
as has been described previously [36,10,37]. 


Training was done with three-fold cross-validation and 


Matthews correlation coefficients [38] were calculated by 
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summing up true positives, false positives, true negatives, 
and false negatives in all combinations of training and test 
sets. Using an architecture with two hidden neurons and 
a symmetric window of nine amino acids centered on the 
glutamine in the P1 position it was possible to obtain a 
correlation coefficient of 0.84 on cross-validated test sets. 
Care was taken to ensure that all cleavage sites were 
equally distributed in every cross-validated set. 


Bayesian statistics 

The validity of the statistics depends on the expected frac- 
tion of cleavage sites in a given data set, which we only 
know in the data set at hand. Statistics was thus done on 
the data set test results in order to create a histogram of 
prediction probabilities. Statistics was done using Bayes' 
Theorem: 


P(X! | Cyos )P(Cpos) 
P(x') 

The prediction outcome (0-1) was divided into 10 bins 
(X') with increments of 0.1. The posterior probability 
P(C,,5|X') gives the probability of a positive prediction 
(that is, a cleavage) being true given the bin. This can be 
calculated from the prior probability P(C,,,), which is the 
fraction of positive examples in the data set, and the class- 
conditional probability P(X'|C,,,) for positive examples, 
which is the fraction of positive examples in the bin X’. 
P(X!) is the fraction of prediction outcomes in bin X!. 


1 
P(Cpos |X) = 


Searching for potential cleavage sites 

An averaged sum of the score of all three networks arising 
from the three-fold cross-validation was used for predic- 
tion. Each network outputs a score in the range [0.000- 
1.000], where scores below 0.5 indicate non-cleavage and 
scores above 0.5 indicate potential cleavage. This method 
is also employed by the prediction web server mentioned 
below. The Swiss-Prot database [16] release 41.0 (Febru- 
ary 2003) was downloaded and proteins from this 
database were used as targets for the neural network 
predictions. 


Availability 

Our neural network based prediction method, NetCo- 
rona, for prediction of potential cleavage sites of the SARS- 
3CL/” proteinase is publicly available by following the 
link 'CBS prediction servers' from http://www.cbs.dtu.dk 
or at this specific URL: http://www.cbs.dtu.dk/services/ 


NetCorona/ 
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